Semi-Supervised Domain Adaptation with 
Non-Parametric Copulas 



David Lopez-Paz 

MPI for Intelligent Systems 

dlopezgtue . mpg . de 



Jose Miguel Hernandez-Lobato 

University of Cambridge 

jmh2 33@cam . ac . uk 



Bernhard Scholkopf 

MPI for Intelligent Systems 
bs@tue . mpg . de 



Abstract 



A new framework based on the theory of copulas is proposed to address semi- 
supervised domain adaptation problems. The presented method factorizes any 
multivariate density into a product of marginal distributions and bivariate cop- 
ula functions. Therefore, changes in each of these factors can be detected and 
corrected to adapt a density model accross different learning domains. Impor- 
tantly, we introduce a novel vine copula model, which allows for this factorization 
in a non-parametric manner. Experimental results on regression problems with 
real-world data illustrate the efficacy of the proposed approach when compared to 
state-of-the-art techniques. 

1 Introduction 

When humans address a new learning problem, they often use knowledge acquired while learning 
different but related tasks in the past. For example, when learning a second language, people rely on 
grammar rules and word derivations from their mother tongue. This is called language transfer |fl9l . 
However, in machine learning, most of the traditional methods are not able to exploit similarities 
between different learning tasks. These techniques only achieve good performance when the data 
distribution is stable between training and test phases. When this is not the case, it is necessary to a) 
collect and label additional data and b) re-run the learning algorithm. However, these operations are 
not affordable in most practical scenarios. 

Domain adaptation, transfer learning or multitask learning frameworks lll7l l2ll5l [T3il confront these 
issues by first, building a notion of task relatedness and second, providing mechanisms to transfer 
knowledge between similar tasks. Generally, we are interested in improving predictive performance 
on a target task by using knowledge obtained when solving another related source task. Domain 
adaptation methods are concerned about what knowledge we can share between different tasks, how 
we can transfer this knowledge and when we should do it or not to avoid additional damage [4|. 

In this work, we study semi-supervised domain adaptation for regression tasks. In these problems, 
the object of interest (the mechanism that maps a set of inputs to a set of outputs) can be stated as 
a conditional density function. The data available for solving each learning task is assumed to be 
sampled from modified versions of a common multivariate distribution. Therefore, we are interested 
in sharing the "common pieces" of this generative model between tasks, and use the data from 
each individual task to detect, learn and adapt the varying parts of the model. To do so, we must 
find a decomposition of multivariate distributions into simpler building blocks that may be studied 
separately across different domains. The theory of copulas provides such representations 1 18 1. 

Copulas are statistical tools that factorize multivariate distributions into the product of its marginals 
and a function that captures any possible form of dependence among them. This function is referred 
to as the copula, and it links the marginals together into the joint multivariate model. Firstly intro- 
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duced by Sklar [22], copulas have been successfully used in a wide range of applications, including 
finance, time series or natural phenomena modeling lfl2l . Recently, a new family of copulas named 
vines have gained interest in the statistics literature [ 1 1. These are methods that factorize multivari- 
ate densities into a product of marginal distributions and bivariate copula functions. Each of these 
factors corresponds to one of the building blocks that we assume either constant or varying across 
different learning domains. 

The contributions of this paper are two-fold. First, we propose a non-parametric vine copula model 
which can be used as a high-dimensional density estimator. Second, by making use of this method, 
we present a new framework to address semi-supervised domain adaptation problems, which per- 
formance is validated in a series of experiments with real-world data and competing state-of-the-art 
techniques. 

The rest of the paper is organized as follows: Section [2] provides a brief introduction to copulas, 
and describes a non-parametric estimator for the bivariate case. Section [3] introduces a novel non- 
parametric vine copula model, which is formed by the described bivariate non-parametric copulas. 
Section|4]describes a new framework to address semi-supervised domain adaptation problems using 
the proposed vine method. Finally, section [5] describes a series of experiments that validate the 
proposed approach on regression problems with real-world data. 

2 Copulas 

When the components of x = (x\, . . . , Xd) are jointly independent, their density function p(x) can 
be written as 

d 

p(x) = l[p(x i ). (1) 

i=l 

This equality does not hold when x\,.. . , x d are not independent. Nevertheless, the differences 
can be corrected if we multiply the right hand side of ([T]i by a specific function that fully describes 
any possible dependence between x\,..., Xd- This function is called the copula of p(x) |[T8l and 
satisfies 

d 

p(x) = t[p(xi) c(P( Xl ), P{x d )) . (2) 

copula 

The copula c is the joint density of P{x\) 1 . . . , P(xd), where P(xi) is the marginal cdf of the ran- 
dom variable Xi. This density has uniform marginals, since P{z) ~ U[0, 1] for any random variable 
z. That is, when we apply the transformation P(x±), . . . , P(xd) to Zi, . . . , Xd, we are eliminating all 
information about the marginal distributions. Therefore, the copula captures any distributional pat- 
tern that does not depend on their specific form, or, in other words, all the information regarding the 
dependencies between x\, . . . , x d - When P(xi), . . . , P(x d ) are continuous, the copula c is unique 
11221 . However, infinitely many multivariate models share the same underlying copula function, as 
illustrated in Figure [T] The main advantage of copulas is that they allow us to model separately the 
marginal distributions and the dependencies linking them together to produce the multivariate model 
subject of study. 

Given a sample from we can estimate p(x) as follows. First, we construct estimates of the 
marginal pdfs, p(xi), . . . ,p(xd), which also provide estimates of the corresponding marginal cdfs, 
P(xi), . . . , P(x d ). These cdfs estimates are used to map the data to the d-dimensional unit hyper- 
cube. The transformed data are then used to obtain an estimate c for the copula of p(x). Finally, (j2j) 
is approximated as 

d 

p(x)=l[p(x i )c(P(x 1 ) > ...,P(x d )). (3) 

i=l 

The estimation of marginal pdfs and cdfs can be implemented in a non-parametric manner by using 
unidimensional kernel density estimates. By contrast, it is common practice to assume a parametric 
model for the estimation of the copula function. Some examples of parametric copulas are Gaussian, 
Gumbel, Frank, Clayton or Student copulas [18|. Nevertheless, real- world data often exhibit com- 
plex dependencies which cannot be correctly described by these parametric copula models. This 
lack of flexibility of parametric copulas is illustrated in Figure [2] As an alternative, we propose 
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Figure 1: Left, sample from a Gaussian copula with correlation p = 0.8. Middle and right, two 
samples drawn from multivariate models with this same copula but different marginal distributions, 
depicted as rug plots. 




Figure 2: Left, sample from the copula linking variables 4 and 1 1 in the WIRELESS dataset. Middle, 
density estimate generated by a Gaussian copula model when fitted to the data. This technique is 
unable to capture the complex patterns present in the data. Right, copula density estimate generated 
by the non-parametric method described in section [2~T| 

to approximate the copula function in a non-parametric manner. Kernel density estimates can also 
be used to generate non-parametric approximations of copulas, as described in [8 |. The following 
section reviews this method for the two-dimensional case. 

2.1 Non-parametric Bivariate Copulas 

We now elaborate on how to non-parametrically estimate the copula of a given bivariate density 
p(x, y). Recall that this density can be factorized as the product of its marginals and its copula 

p(x, y) = p(x)p(y) c{P(x),P(y)). (4) 

Additionally, given a sample {(xi, fr° m p(x,y), we can obtain a pseudo-sample from its 

copula c by mapping each observation to the unit square using estimates of the marginal cdfs, namely 

{{u hVi )}U :={(P(zi),P(yi))}? =1 . (5) 

These are approximate observations from the uniformly distributed random variables u = P(x) and 
v = P(y), whose joint density is the copula function c(u,v). We could try to approximate this 
density function by placing Gaussian kernels on each observation ui and Vi. However, the resulting 
density estimate would have support on M 2 , while the support of c is the unit square. A solution 
is to perform the density estimation in a transformed space. For this, we select some continuous 
distribution with support on R, strictly positive density cf>, cumulative distribution $ and quantile 
function <1> _1 . Let z and w be two new random variables given by z = <l> _1 (?i) and w = Q~ 1 (v). 
Then, the joint density of z and w is 

p(z,w) = <j>(z) 4>{w) c($(z),$(w)) . (6) 

The copula of this new density is identical to the copula of Q, since the performed transforma- 
tions are marginal- wise. The support of is now E 2 ; therefore, we can now approximate it with 



Gaussian kernels. Let Zi = $ and wi — $ 1 (vi). Then, 

1 - 

p(z,w) = -y]Af(z,w\z i ,w i ,'E), (7) 



n 
i=l 



where A/"(-, ^2, S) is a two-dimensional Gaussian density with mean (ux, v 2 ) and covariance 
matrix S. For convenience, we select <j>, $ and to be the standard Gaussian pdf, cdf and 
quantile function, respectively. Finally, the copula density c(u, v) is approximated by combining (j6j) 
with Q: 

3 Regular Vines 

The method described above can be generalized to the estimation of copulas of more than two ran- 
dom variables. However, although kernel density estimates can be successful in spaces of one or two 
dimensions, as the number of variables increases, this methods start to be significantly affected by 
the curse of dimensionality and tend to overfit to the training data. Additionally, for addressing do- 
main adaptation problems, we are interested in factorizing these high-dimensional copulas into sim- 
pler building blocks transferrable accross learning domains. These two drawbacks can be addressed 
by recent methods in copula modelling called vines 0]. Vines decompose any high-dimensional 
copula density as a product of bivariate copula densities that can be approximated using the non- 
parametric model described above. These bivariate copulas (as well as the marginals) correspond to 
the simple building blocks that we plan to transfer from one learning domain to another. Different 
types of vines have been proposed in the literature. Some examples are canonical vines, D-vines 
or regular vines lfl6l [P. In this work we focus on regular vines (R-vines) since they are the most 
general models. 

An R-vine V for a probability density p{x\, . . . , Xd) with variable set V = {1, ... d} is formed 
by a set of undirected trees Ti, . . . , T<j_x, each of them with corresponding set of nodes Vi and set 
of edges Ei, where Vi = for i G [2,d — 1] . Any edge e 6 Ei has associated three sets 

C(e), D(e), N(e) C V called the conditioned, conditioning and constraint sets of e, respectively. 
Initially, T\ is inferred from a complete graph with a node associated with each element of V; for any 
e G T x joining nodes Vj and V k , C(e) = N(e) = {Vj,V k } and D(e) = {0}. The trees T 2 , T d _ x 
are constructed so that each e G Ei is formed by joining two edges ex, e% £ E^x which share a 
common node, for i > 2. The new edge e has conditioned, conditioning and constraint sets given 
by C(e) = N(e 1 )AN(e 2 ), D(e) = N(ex) n N(ex), N(e) = N(ex) U 7Y(e 2 ), where A is the 
symmetric difference operator. Figure [3] illustrates this procedure for an R-vine with 4 variables. 

For any edge e(j, k) G Ti, i — 1, . . . , d — 1 with conditioned set C(e) = {j, k} and conditioning 
set D(e) let cm^U) be the value of the copula density for the conditional distribution of Xj and x k 
when conditioning on {xi : i G D(e)}, that is, 

Cjk\D(e) ■= c(Pj\D(e),Pk\D(e)\%i ■ « € D(e)), (9) 

where Pj\n(e) '■— P(xj\xi : i G D(e)) is the conditional cdf of Xj when conditioning on {xi : i G 
D(e)}. Kurowicka and Cooke [ 16 1 indicate that any probability density function p(xx, ■ ■ ■ , Xd) can 
then be factorized as 

d d-1 

P( x ) = Y[p( x i) II II C ok\D{e) > (10) 
i=X i=l e(j,k)eE l 

where Ex, ■ ■ ■ , E&-\ are the edge sets of the R-vine V for p{x x , . . . , Xd)- In particular, each of the 
edges in the trees from V specify a different conditional copula density in ( [T0| . For d variables, the 



density in ( 10 1 is formed by d(d — l)/2 factors. Changes in each of these factors can be detected 
and independently transferred accross different learning domains to improve the estimation of the 
target density function. 

The definition of Cjh\D(e) m Q requires the calculation of conditional marginal cdfs. For this, we 
use the following recursive identity introduced by Joe [ 14 1, that is, 

dC MD{e) \ k 

C-r k \D(e)\k 
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Tree 1 Tree 2 Tree 3 





Marginals Tree 1 Xree 2 Tree 3 



Figure 3: Example of the hierarchical construction of a R-vine copula for a system of four variables. 
The edges selected to form each tree are highlighted in bold. Conditioned and conditioning sets for 
each node and edge are shown as C(e)\D(e). Later, each edge in bold will correspond to a different 
bivariate copula function. 



which holds for any k E D(e), where D(e) \k = {i : i E D(e) A i ^ k} and Cjk\o(e)\k is the cdf 

Of Cjk\D(e)\k- 

One major advantage of vines is that they can model high-dimensional data by estimating density 
functions of only one or two random variables. For this reason, these techniques are significantly 
less affected by the curse of dimensionality than regular density estimators based on kernels, as we 
show in Section [5] So far Vines have been generally constructed using parametric models for the 
estimation of bivariate copulas. In the following, we describe a novel method for the construction 
of non-parametric regular vines. 



3.1 Non-parametric Regular Vines 



In this section, we introduce a vine distribution in which all participant bivariate copulas can be 
estimated in a non-parametric manner. Todo so, we model each of the copulas in ( 10 1 using the non- 
parametric method described in Section [2T| Let {(ut, Vi)}T=i be a sample from the copula density 
c(u, v). The basic operation needed for the implementation of the proposed method is the evaluation 



of the conditional cdf P(u\v) using the recursive equation (111. Define w = $ 1 (v), z,i = <£> 1 (itj) 



and Wi = $ (vi). Combining dSj) and 
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we obtain 



P(u\v) 



n(j)(w) 



c(x, v) dx 



E 



M(^>- 1 (x),w\z l ,w l ,-E) 



dx 



- (i Zi \ Wi 



Zi\Wi 



7 



7 

^2 



(12) 



the 



where Af(-\n, a 2 ) denotes a Gaussian density with mean /i and variance a 2 , £ = 
kernel bandwidth matrix, (i Zi \ Wi — Zi + ^j(w — and <J 2 ^ w — <J 2 {1 — r ) 2 ). 

Equation ( 12 1 can be used to approximate any conditional cdf Pjmu). For this, we use the fact that 
P(xj\xi : i E D(e)) = P(uj\ui : i E D(e)), where u t = P(xi), for i = 1, . . . , d, and recursively 
using equation ( 12 1 to compute P(v,j \v,i : i E D(e)). 



\Xi : 
apply rule 
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To complete the inference recipe for the non-parametric regular vine, we must specify how to con- 
struct the hierarchy of trees T\, . . . , Td-i- In other words, we must define a procedure to select the 
edges (bivariate copulas) that will form each tree. We have a total of d(d — l)/2 bivariate copulas 
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which should be distributed among the different trees. Ideally, we would like to include in the first 
trees of the hierarchy the copulas with strongest dependence level. This will allow us to prune the 
model by assuming independence in the last k < d trees, since the density function for the inde- 
pendent copula is constant and equal to 1. To construct the trees Ti, . . . , Td-i, we assign a weight 
to each edge e(j, k) (copula) according to the level of dependence between the random variables Xj 
and Xk- A common practice is to fix this weight to the empirical estimate of Kendall's' r for the two 
random variables under consideration! 1 Q Given these weights for each edge, we propose to solve 
the edge selection problem by obtaining d — 1 maximum spanning trees. Prim's Algorithm fl20l can 
be used to solve this problem efficiently. 



4 Domain Adaptation with Regular Vines 

In this section we describe how regular vines can be used to address domain adaptation problems 
in the non-linear regression setting with continuous data. The proposed approach could be easily 
extended to other problems such as density estimation or classification. In regression problems, we 
are interested in inferring the mapping mechanism or conditional distribution with density p(y\x) 
that maps one feature vector x — (x±, . . . , Xd) G M. d into a target scalar value y e BL Rephrased 
into the copula framework, this conditional density can be expressed as 

d 

p(y\x) (x p(y)Y[ Yi c Jk\D(e) (13) 

i=l e (j,k)eEi 

where E\, . . . , Ed are the edge sets of an R-vine for p(x, y). Note that the normalization of the right 



part of ( 13 1 is relatively easy since y is scalar. 



In the classic domain adaptation setup we usually have large amounts of data for solving a source 
task characterized by the density function p s (x, y). However, only a partial or reduced sample is 
available for solving a target task with density pt(x, y). Given the data available for both tasks, our 
objective is to build a good estimate for the conditional density p t (j/|x). To address this domain 
adaptation problem, we assume that p t is a modified version of p s . In particular, we assume that 
p t is obtained in two steps from p s . First, p s is expressed using an R-vine representation as in ( |T0] > 
and second, some of the factors included in that representation (marginal distributions or pairwise 
copulas) are modified to derive p t . All we need to address the adaptation across domains is to 
reconstruct the R-vine representation of p s using data from the source task, and then identify which 
of the factors have been modified to produce p t . These factors are corrected using data from the 
target task. In the following, we describe how to identify and correct these modified factors. 

Marginal distributions can change between source and target tasks (also known as covariate shift). 
In this case, P s {xi) ^ Pt(xi), for i = 1, . . . , d, or P s (y) ^ Pt(y), and we need to re-generate 
the estimates of the affected marginals using data from the target task. Additionally, some of the 
bivariate copulas Cj^mM may differ from source to target tasks. In this case, we also re-estimate 
the affected copulas using data from the target task. Simultaneous changes in both copulas and 
marginals can occur. However, there is no limitation in updating each of the modified components 
separately. Finally, if some of the factors remain constant across domains, we can use the available 
data from the target task to improve the estimates obtained using only the data from the source 
task. Note that we are addressing a more general problem than covariate shift. Besides identifying 
and correcting changes in marginal distributions, we also consider changes in any possible form of 
dependence (conditional distributions) between random variables. 

For the implementation of the strategy mentioned above, we need to identify when two samples 
come from the same distribution or not. For this, we propose to use the non-parametric two-sample 
test Maximum Mean Discrepancy (MMD) ifTUl . MMD will return low p- values when two samples 
are unlikely to have been drawn from the same distribution. Specifically, given samples from two 
distributions P and Q, MMD will determine P ^ Q if the distance between the embeddings of the 
empirical distributions for these two samples in a RKHS is significantly large. 



'We have tried more general dependence measures such as the HSIC (Hilbert-Schmidt Independence Crite- 
rion) without observing gains that justify the increase of computational costs. 
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Table 1: Average TLL obtained by NPRV, GRV and KDE on six different UCI datasets. 



Dataset 


Auto 


Cloud 


Housing 


Magic 


Page-Blocks 


Wireless 


No. of variables 


8 


10 


14 


11 


10 


11 


KDE 
GRV 
NPRV 


1.32 ±0.06 
1.84 ±0.08 
2.07 ± 0.07 


3.25 ±0.10 
5.00 ± 0.12 

4.54 ±0.13 


1.96 ±0.17 
1.68 ±0.11 
3.18 ± 0.17 


1.13 ±0.11 
2.09 ± 0.08 
2.72 ± 0.17 


1.90 ±0.13 
4.69 ± 0.20 
5.64 ± 0.14 


0.98 ± 0.06 
0.36 ± 0.08 
2.17 ± 0.13 



Semi-supervised and unsupervised domain adaptation: The proposed approach can be easily 
extended to take advantage of additional unlabeled data to improve the estimation of our model. 
Specifically, extra unlabeled target task data can be used to refine the factors in the R-Vine decom- 
position of p t which do not depend on y. This is still valid even in the limiting case of not having 
access to labeled data from the target task at training time {unsupervised domain adaptation). 

5 Experiments 

To validate the proposed method, we run two series of experiments using real world data. The first 
series illustrates the accuracy of the density estimates generated by the proposed non-parametric 
vine method. The second series validates the effectiveness of the proposed framework for domain 
adaptation problems in the non-linear regression setting. In all experiments, kernel bandwidth ma- 
trices are selected using Silverman's rule-of-thumb EH . For comparative purposes, we include the 
results of different state-of-the-art domain adaptation methods whose parameters are selected by a 
10-fold cross validation process on the training data. 

Approximations: A complete R-Vine requires the use of conditional copula functions, which are 
challenging to learn. A common approximation is to ignore any dependence between the copula 
functional form and its set of conditioning variables. Note that the copula functions arguments re- 
main to be conditioned cdfs. Moreover, to avoid excesive computational costs, we consider only 
the first tree (d — 1 copulas) of the R-Vine, which is the one containing the most amount of depen- 
dence between the distribution variables. Increasing the number of considered trees did not lead to 
significant performance improvements. 

5.1 Accuracy of Non-parametric Regular Vines for Density Estimation 

The density estimates generated by the new non-parametric R-vine method (NPRV) are evaluated on 
data from six normalized UCI datasets |9|. We compare against a standard density estimator based 
on Gaussian kernels (KDE), and a parametric vine method based on bivariate Gaussian copulas 
(GRV). From each dataset, we extract 50 random samples of size 1000. Training is performed using 
30% of each random sample. Average test log-likelihoods and corresponding standard deviations 
on the remaining 70% of the random sample are summarized in Table[T]for each technique. In these 
experiments, NPRV obtains the highest average test log-likelihood in all cases except one, where it 
is outperformed by GRV. KDE shows the worst performance, due to its direct exposure to the curse 
of dimensionality. 

5.2 Comparison with other Domain Adaptation Methods 

NPRV is analyzed in a series of experiments for domain adaptation on the non-linear regression 
setting with real-world data. Detailed descriptions of the 6 UCI selected datasets and their domains 
are available in the supplementary material. The proposed technique is compared with different 
benchmark methods. The first two, GP-SOURCE and GP-All, are considered baselines. They are 
two gaussian process (GP) methods, the first one trained only with data from the source task, and 
the second one trained with the normalized union of data from both source and target problems. 
The other five methods are considered state-of-the-art domain adaptation techniques. Daume Q 
performs a feature augmentation such that the kernel function evaluated at two points from the same 
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Table 2: Average NMSE and standard deviation for all algorithms and UCI datasets. 



D&tssct 


Wine 


Sarcos 


Rocks -Mines 


Hill Valine 

run- v<iiicjs 


Axis-Slice 


Isolet 




12 


21 


60 


100 


386 


617 




U.OD In U.UZ. 


1 .oU it U.U^ 


QO -I- 01 


1 00 -4- 00 


1 <9 _j_ 09 


1 coin 09 
L.jy it U.UZ 


GP-AU 


K\ 4- 03 

u.UJ _J_ vi.yj^j 


1 69 + 04 


1 10 + 08 


87 4- 06 


1 27 + 07 


1 + 02 


Daume 


0.97 ± 0.03 


0.88 ± 0.02 


0.72 ± 0.09 


0.99 ± 0.03 


0.95 ± 0.02 


0.99 ± 0.00 


SSL-Daume 


0.82 ± 0.05 


0.74 ± 0.08 


0.59 ± 0.07 


0.82 ± 0.07 


0.65 ± 0.04 


0.64 ± 0.02 


ATGP 


0.86 ± 0.08 


0.79 ± 0.07 


0.56 ± 0.10 


0.15 ±0.07 


1.00 ±0.01 


1.00 ±0.00 


KMM 


1.03 ±0.01 


1.00 ±0.00 


1.00 ±0.00 


1.00 ±0.00 


1.00 ±0.00 


1.00 ±0.00 


KuLSIF 


0.91 ± 0.08 


1.67 ± 0.06 


0.65 ±0.10 


0.80 ± 0.11 


0.98 ± 0.07 


0.58 ± 0.02 


NPRV 


0.73 ± 0.07 


0.61 ± 0.10 


0.72 ±0.13 


0.15 ± 0.07 


0.38 ± 0.07 


0.46 ± 0.09 


UNPRV 


0.76 ± 0.06 


0.62 ±0.13 


0.72 ±0.15 


0.19 ±0.09 


0.37 ± 0.07 


0.42 ± 0.04 


Av. Ch. Mar. 


10 


1 


38 


100 


226 


89 


Av. Ch. Cop. 


5 


8 


49 


34 


155 


474 



domain is twice larger than when these two points come from different domains. SSL-Daume ||6J is 
a SSL extension of Daume which takes into account unlabeled data from the target domain. ATGP 
[4 1 models the source and target task data using a single GP, but learns additional kernel parameters 
to correlate input vectors between domains. This method outperforms others like the one proposed 
by Bonilla et al. J3). KMM ifTTI minimizes the distance of marginal distributions in source and 
target domains by matching their means when mapped into an universal RKHS. Finally, KuLSIF 
ITT31 operates in a similar way as KMM. Besides NPRV, we also include in the experiments its fully 
unsupervised variant, UNPRV, which ignores any labeled data from the target task. 

For training, we randomly sample 1000 data points for both source and target tasks, where all the 
data in the source task and 5% of the data in the target task are labeled. The test set contains 1000 
points from the target task. Table [2] summarizes the average test normalized mean square error 
(NMSE) and corresponding standard deviation for each method in each dataset across 30 random 
repetitions of the experiment. The proposed methods obtain the best results in 5 out of 6 cases. 
Notably, UNPRV (Unsupervised NPRV), which ignores labeled data from the target task, also 
outperforms the other benchmark methods in most cases. Finally, the two bottom rows in Table 
[2] show the average number of marginals and bivariate copulas which are updated in each dataset 
during the execution of NPRV, respectively. 

Computational Costs: Running NPRV requires to fill in a weight matrix of size 0(d 2 ) with 
the empirical estimates of Kendall's r for any two random variables. The computation of each of 
these estimates can be done efficiently with cost O(nlogn), where n is the number of available 
data points. Therefore, the final training cost of NPRV is O(d 2 n\ogn). In practice, we obtain 
competitive training times. Training NPRV for the Isolet dataset took about 3 minutes on a regular 
laptop computer. Predictions made by a single level NPRV have cost 0(nd). Parametric copulas 
may be used to reduce the computational demands. 

6 Conclusions 

We have proposed a novel non-parametric domain adaptation strategy based on copulas. The new 
approach works by decomposing any multivariate density into a product of marginal densities and 
bivariate copula functions. Changes in these factors across different domains can be detected using 
two sample tests, and transferred across domains in order to adapt the target task density model. 
A novel non-parametric vine method has been introduced for the practical implementation of this 
method. This technique leads to better density estimates than standard parametric vines or KDE, and 
is also able to outperform a large number of alternative domain adaptation methods in a collection 
of regression problems with real-world data. 
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